train deep tree model
Productionizing Distributed XGBoost to Train Deep Tree Models with Large Data Sets at Uber
Zero values in SparseVectors are treated by XGBoost on Apache Spark as missing values (defaults to Float.NaN) whereas zeroes in DenseVectors are simply treated as zeros. Vector storage in Apache Spark ML is implicitly optimized, so a vector array is stored as a SparseVector or DenseVector based on space efficiency. If an ML practitioner tries to feed a DenseVector at inference time to a model that is trained on SparseVector or vice versa, XGBoost does not provide any warning and the prediction input will likely go into unexpected branches due to the way zeroes are stored, resulting in inconsistent predictions. Hence, it is critical that the storage structure input remains consistent between serving and training times.